Design and Analysis of a Large Corpus of Post-Edited Translations: Quality Estimation, Failure Analysis and the Variability of Post-Edition
نویسندگان
چکیده
Machine Translation (MT) is now often used to produce approximate translations that are then corrected by trained professional post-editors. As a result, more and more datasets of post-edited translations are being collected. These datasets are very useful for training, adapting or testing existing MT systems. In this work, we present the design and content of one such corpus of post-edited translations, and consider less studied possible uses of these data, notably the development of an automatic Quality Estimation (QE) system and the detection of frequent errors in automatic translations. Both applications require a careful assessment of the variability in post-editions, that we study here.
منابع مشابه
A corpus of post-edited translations (Un corpus d'erreurs de traduction) [in French]
A corpus of post-edited translations More and more datasets of post-edited translations are being collected. These corpora have many applications, such as failure analysis of SMT systems and the development of quality estimation systems for SMT. This work presents a large corpus of post-edited translations that has been gathered during the ANR/TRACE project. Applications to the detection of fre...
متن کاملCollection of a Large Database of French-English SMT Output Corrections
Corpus-based approaches to machine translation (MT) rely on the availability of parallel corpora. To produce user-acceptable translation outputs, such systems need high quality data to be efficiently trained, optimized and evaluated. However, building high quality dataset is a relatively expensive task. In this paper, we describe the data collection and analysis of a large database of 10.881 SM...
متن کاملTowards a better understanding of statistical post-edition usefulness
We describe several experiments to better understand the usefulness of statistical post-edition (SPE) to improve phrasebased statistical MT (PBMT) systems raw outputs. Whatever the size of the training corpus, we show that SPE systems trained on general domain data offers no breakthrough to our baseline general domain PBMT system. However, using manually post-edited system outputs to train the ...
متن کاملSECTra_w.1: an Online Collaborative System for Evaluating, Post-editing and Presenting MT Translation Corpora
SECTra_w.1 is a web-oriented system mainly dedicated to the evaluation of MT systems. After importing a source corpus, and possibly reference translations, one can call various MT systems, store their results, and have a collection of human judges perform subjective evaluation online (fluidity, adequacy). It is also possible to perform objective, task-oriented evaluation by letting humans post-...
متن کاملA Corpus of Machine Translation Errors Extracted from Translation Students Exercises
In this paper, we present a freely available corpus of automatic translations accompanied with post-edited versions, annotated with labels identifying the different kinds of errors made by the MT system. These data have been extracted from translation students exercises that have been corrected by a senior professor. This corpus can be useful for training quality estimation tools and for analyz...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013